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Abstract.  The  formulation  minx,y  f(x)  +  g(y)  subject  to  Ax  +  By  =  b  arises  in  many  application  areas  such  as  signal 
processing,  imaging  and  image  processing,  statistics,  and  machine  learning  either  naturally  or  after  variable  splitting.  In  many 
common  problems,  one  of  the  two  objective  functions  is  strongly  convex  and  has  Lipschitz  continuous  gradient.  On  this  kind 
of  problem,  a  very  effective  approach  is  the  alternating  direction  method  of  multipliers  (ADM,  also  known  as  ADMM),  which 
solves  a  sequence  of  f  / g- decoupled  subproblems.  However,  its  effectiveness  has  not  been  matched  by  a  provably  fast  rate  of 
convergence;  only  sublinear  rates  such  as  0(1/ k)  and  0(1/ k2)  were  recently  established  in  the  literature,  though  these  rates  do 
not  require  strong  convexity.  This  paper  shows  that  global  linear  convergence  can  be  guaranteed  under  the  above  assumptions 
on  strong  convexity  and  Lipschitz  gradient  on  one  of  the  two  functions,  along  with  certain  rank  assumptions  on  A  and  B.  The 
result  applies  to  the  generalized  ADMs  that  allow  the  subproblems  to  be  solved  faster  and  less  exactly  in  certain  manners.  In 
addition,  the  rate  of  convergence  provides  some  theoretical  guidance  for  optimizing  the  ADM  parameters. 
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1.  Introduction.  The  alternating  direction  method  of  multipliers  (ADM  or  ADMM)  is  very  effective 
at  solving  many  practical  optimization  problems  and  has  wide  applications  in  areas  such  as  signal  and  image 
processing,  machine  learning,  statistics,  compressive  sensing,  and  operations  research.  We  refer  to  [1-10] 
for  a  few  examples  of  applications.  The  ADM  is  applied  to  constrained  convex  optimization  problems  with 
separable  objective  functions  in  the  following  form 


min  /( x)  +  g(y) 
x,y 

s.t.  Ax  +  By  =  6, 


(i.i) 


where  x  £  Mn  and  y  £  Rm  are  unknown  variables,  A  £  Rpxn  and  B  £  ffi. pxm  are  given  matrices,  and 
/  :  R"  ->MU  {+oo}  and  g  :  Rm  ->  ill  {+oo}  are  closed  proper  convex  functions.  Some  original  problems 
are  not  in  the  form  of  (1.1),  but  after  introducing  variables  and  constraints,  they  become  the  form  of  (1.1). 
For  example,  introducing  y  =  Ax,  problem  min Xf(x)  +  g(Ax)  is  transformed  to  (1.1)  with  B  =  —I  and 
6  =  0.  Constraints  x  £  X  and  y  £  y,  where  X  C  Rn  and  y  C  Rm  are  closed  convex  sets,  can  be  included 
as  the  (extended- value)  indicator  functions  Ix(%)  and  Iy  (y)  in  the  objective  functions  /  and  g.  Here  the 
indicator  function  of  a  convex  set  C  is  defined  by 


Ic{z) 


0  if  z  £  C, 

oo  if  z  (jiC. 


(1.2) 


The  main  goal  of  this  paper  is  to  show  that  the  ADM  applied  to  (1.1)  has  global  linear  convergence 
provided  that  /  is  strongly  convex,  V/  is  Lipschitz  continuous,  and  matrices  A  and  B  have  certain  rank  con¬ 
ditions.  The  convergence  analysis  is  performed  under  a  general  framework  that  allows  the  ADM  subproblems 
to  be  solved  inexactly  and  faster. 


1.1.  Background.  The  classic  ADM  was  first  introduced  in  [11, 12].  Consider  the  augmented  La¬ 
grangian  function  of  (1.1): 


£a{x,  y,  A)  =  f(x)  +  g(y)  -  XT  (Ax  +  By  -  b)  +  ^  ||  Ax  +  By-b\\j, 


(1.3) 


*  Department  of  Computational  and  Applied  Mathematics,  Rice  University,  Houston,  TX  77005  ({wei.deng, 
wotao.yin}@rice.edu) 


1 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

AUG  2012 


2.  REPORT  TYPE 


4.  TITLE  AND  SUBTITLE 

On  the  Global  and  Linear  Convergence  of  the  Generalized  Alternating 
Direction  Method  of  Multipliers 

6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Rice  University, Department  of  Computational  and  Applied 
Mathematics, Houston, TX, 77005 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-2012  to  00-00-2012 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

The  formulation  minx;y  f(x)  +  g(y)  subject  to  Ax  +  By  =  b  arises  in  many  application  areas  such  as  signal 
processing,  imaging  and  image  processing,  statistics,  and  machine  learning  either  naturally  or  after 
variable  splitting.  In  many  common  problems,  one  of  the  two  objective  functions  is  strongly  convex  and  has 
Lipschitz  continuous  gradient.  On  this  kind  of  problem,  a  very  e  ective  approach  is  the  alternating 
direction  method  of  multipliers  (ADM,  also  known  as  ADMM),  which  solves  a  sequence  of  f/g-decoupled 
subproblems.  However,  its  e  ectiveness  has  not  been  matched  by  a  provably  fast  rate  of  convergence;  only 
sublinear  rates  such  as  0(l=k)  and  0(l=k2)  were  recently  established  in  the  literature,  though  these  rates 
do  not  require  strong  convexity.  This  paper  shows  that  global  linear  convergence  can  be  guaranteed  under 
the  above  assumptions  on  strong  convexity  and  Lipschitz  gradient  on  one  of  the  two  functions,  along  with 
certain  rank  assumptions  on  A  and  B.  The  result  applies  to  the  generalized  ADMs  that  allow  the 
subproblems  to  be  solved  faster  and  less  exactly  in  certain  manners.  In  addition,  the  rate  of  convergence 
provides  some  theoretical  guidance  for  optimizing  the  ADM  parameters. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

21 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


where  A  £  Rp  is  the  Lagrangian  multiplier  vector  and  /3  >  0  is  a  penalty  parameter.  The  classic  augmented 
Lagrangian  method  (ALM)  minimizes  CA(x,y,X)  over  x  and  y  jointly  and  then  updates  A.  However,  the 
ADM  replaces  the  joint  minimization  by  minimization  over  x  and  y,  one  after  another,  as  described  in 
Algorithm  1.  Compared  to  the  ALM,  though  the  ADM  may  take  more  iterations,  it  often  runs  faster  due 
to  the  easier  subproblems. 


Algorithm  1:  Classic  ADM 


l  Initialize  x°,  A0,  /3  >  0; 


2  for  k  =  0,  1, . . .  do 


3 

4 

5 


yk+ 1  =  arg miny  CA(xk,y,  Afc); 
xk+1  =  argmiOj,  CA(x,  yk+1,  Afe); 
Xk+i  =  Xk  _  p(Axk+ 1  +  Byk+1  - 


b )• 


Although  there  is  extensive  literature  on  the  ADM  and  its  applications,  there  are  very  few  results  on 
its  rate  of  convergence  until  the  very  recent  past.  Work  [13]  shows  that  for  a  Jacobi  version  of  the  ADM 
applied  to  smooth  functions  with  Lipschitz  continuous  gradients,  the  objective  value  descends  at  the  rate 
of  0{l/k)  and  that  of  an  accelerated  version  descends  at  0(l/k2).  Then,  work  [14]  establishes  the  same 
rates  on  a  Gauss-Seidel  version  and  requires  only  one  of  the  two  objective  functions  to  be  smooth  with 
Lipschitz  continuous  gradient.  Lately,  work  [15]  shows  that  \\uk  —  Mfe+1||,  where  uk  :=  (xk,yk,Xk),  of  the 
ADM  converges  at  0(1/ A;)  assuming  at  least  one  of  the  subproblems  is  exactly  solved.  Work  [16]  proves 
that  the  dual  objective  value  of  a  modification  to  the  ADM  descends  at  0(1/ A;2)  under  the  assumption  that 
the  objective  functions  are  strongly  convex  (one  of  them  being  quadratic)  and  both  subproblems  are  solved 
exactly.  We  show  the  linear  rate  of  convergence  0(l/cfe)  for  some  c  <  1  under  a  variety  of  scenarios  in  which 
at  least  one  of  the  two  objective  functions  is  strongly  convex  and  has  Lipschitz  continuous  gradient.  This 
rate  is  stronger  than  the  sublinear  rates  such  as  0(l/k)  and  0(l/k2)  and  is  given  in  terms  of  the  solution 
error,  which  is  stronger  than  those  given  in  terms  of  the  objective  error  in  [13, 14, 16]  and  the  solution  relative 
change  in  [15].  On  the  other  hand,  [13-15]  do  not  require  any  strongly  convex  functions.  The  fact  that  a 
wide  range  of  applications  give  rise  to  model  (1.1)  with  at  least  one  strongly  convex  functions  has  motivated 
this  work. 

During  the  final  polishing  of  this  paper,  we  noticed  the  work  [17]  through  private  communication,  which 
proves  the  linear  convergence  of  ADM  in  a  different  approach.  The  analysis  in  [17]  requires  that  the  objective 
function  is  smooth  and  the  step  size  for  updating  the  multiplier  is  sufficiently  small,  while  no  explicit  linear 
rate  is  given.  On  the  other  hand,  it  allows  more  than  two  blocks  of  separable  variables  and  it  does  not 
require  strong  convexity. 

It  is  worth  mentioning  that  the  ADM  applied  to  linear  programming  is  known  to  converge  at  a  global 
linear  rate  [18].  For  quadratic  programming,  work  [19]  presents  an  analysis  leading  to  a  conjecture  that  the 
ADM  should  converge  linearly  near  the  optimal  solution.  Our  analysis  in  this  paper  is  different  from  those 
in  [18,19], 

Variants  of  Algorithm  1  that  allow  CA  to  be  inexactly  minimized  over  x  or  y  are  very  important  to  the 
applications  in  which  it  is  expensive  to  exactly  solve  either  the  x-subproblem  or  the  y-subproblem,  or  both 
of  them.  For  this  reason,  we  present  Algorithm  2  below,  which  is  more  general  than  Algorithm  1  by  allowing 
easier  subproblems.  Our  results  are  established  for  Algorithm  2. 
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Algorithm  2:  Generalized  ADM 


1  Choose  Q  >z  0  and  a  symmetric  matrix  P.  Initialize  x°,  A0,  j3  >  0,7  >  0; 


2  for  k  =  0,  1, . . .  do 


3 

4 

5 


yk+ 1  =  argminy£4(xfe,y,Afc)  +  \(y  ~  yk)T Q(;y  ~  yk)\ 
xk+1  =  argmiix,,  C^(x,  yk+1,Xk)  +  |(a;  —  xk)T P(x  —  xk)] 
\k+ 1  =  Afc  -  7/3(Axfc+1  +  -  6). 


Compared  to  Algorithm  1,  Algorithm  2  adds  |||y  —  2/fe  II Q  and  \\\x  —  xk\\p  to  the  y-  and  ir-subproblems, 
respectively,  and  assigns  7  as  the  step  size  for  the  update  of  A.  Here,  we  use  the  notion  \\x\\2M  :=  xT Mx.  If 
M  >-  0,  ||x||m  is  a  norm,  but  we  abuse  the  notation  by  allowing  any  symmetric  matrix  M.  Different  choices 
of  P  and  Q  are  overviewed  in  the  next  subsection.  They  can  make  steps  3  and  4  of  Algorithm  2  easier  than 
those  of  Algorithm  1. 

We  do  not  fix  7  =  1  like  in  most  of  the  ADM  literature  since  7  plays  an  important  role  in  convergence 
and  speed.  For  example,  when  P  =  0  and  Q  =  0,  any  7  £  (0,  (x/5  +  l)/2)  guarantees  the  convergence  of 
Algorithm  2  [20],  but  7  =  1.618  makes  it  converge  noticeably  faster  than  7=1.  The  range  of  7  depends  on 
P  and  Q,  as  well  as  /3.  When  P  is  indefinite,  7  must  be  smaller  than  1  or  the  iteration  may  diverge. 

Let  us  overview  two  works  very  related  to  Algorithm  2.  Work  [21]  considers  (1.3)  where  the  quadratic 
penalty  term  is  generalized  to  ||  Ax  +  By  —  b\\2H  for  a  sequence  of  bounded  positive  definite  matrices  { Hk }, 
and  the  work  proves  the  convergence  of  Algorithm  2  restricted  to  7  =  1  and  differential  functions  /  and  g. 
Work  [22]  replaces  7  by  a  general  positive  definite  matrix  C  and  establishes  convergence  assuming  that  A  =  I 
and  the  smallest  eigenvalue  of  C  is  no  greater  than  1,  which  corresponds  to  7  <  1  when  C  =  7 1.  In  these 
works,  both  P  and  Q  are  restricted  to  positive  semi-definite  matrices,  and  there  is  no  rate  of  convergence 
given. 

In  addition  to  deriving  linear  convergence  rates,  this  paper  makes  meaningful  extensions  to  the  existing 
convergence  theory  of  Algorithm  2.  Specifically,  the  step  size  7  is  less  restrictive,  and  P  is  allowed  to  be 
indefinite.  These  extensions  translate  to  faster  convergence  and  more  options  of  solving  the  s-subproblem 
efficiently. 

1.2.  Inexact  ADM  subproblems.  By  “inexact”,  we  mean  that  the  ADM  subproblems  in  Algorithm  1 
are  replaced  by  their  approximations  that  are  easier  to  solve.  We  do  not  consider  the  errors  in  the  subproblem 
solutions  due  to  finite-precision  arithmetics  or  early-stopping  of  a  subproblem  solver. 

Let  us  give  a  few  examples  of  matrix  P  in  step  4  of  Algorithm  2.  These  examples  also  apply  to  Q  in 
step  3.  Note  that  P  and  Q  can  be  different. 

Prox-linear  [23].  Setting 

p  =  -i  -  PATA 
r 

gives  rise  to  a  prox-linear  problem  at  step  4  of  Algorithm  2: 

min /(*)  +  (3  (Vf  (*  -  xk )  +  ±-\\x  -  xk\\fj  ,  (1.4) 

where  r  >  0  is  a  proximal  parameter  and  hk  :=  AT(Axk  +  Byk+1  —b  —  Xk //3)  is  the  gradient  of  the  last  two 
terms  of  (1.3)  at  x  =  xk . 

This  P  makes  step  4  much  easier  to  compute  in  various  applications.  For  example,  if  /  is  a  separable 
function,  problem  (1.4)  reduces  to  a  set  of  independent  one-dimensional  problems.  In  particular,  if  /  is  £± 
norm,  the  solution  is  given  in  the  closed  form  by  so-called  soft-thresholding.  If  /  is  the  matrix  nuclear  norm, 
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then  singular-value  soft-thresholding  is  used.  If  /(: x)  =  ||4>a;||i  where  $  is  an  orthogonal  operator  or  a  tight 
frame,  (1.4)  also  has  a  closed-form  solution.  If  /  is  total  variation,  (1.4)  can  be  solved  by  graph-cut  [24,25]. 
There  are  a  large  number  of  such  examples  in  signal  processing,  imaging,  statistics,  machine  learning,  etc. 
Gradient  descent.  When  function  /  is  quadratic ,  letting 

P  =  ^1  -  Hf  ~  (3AtA,  Hf  :=  V2f(x)  h  0, 

gives  rise  to  a  gradient  descent  step  for  step  4  of  Algorithm  2  since  the  problem  becomes 

min  (gk)T(x-xk)  +  ±\\x  -  xk\\l,  (1.5) 

x  la 

where  gk  :=  Vf(xk)  +  /3AT(Axk  +  Byk+1  ~b  -  X  k//3)  is  the  gradient  of  CA(x,yk+\Xk)  at  x  =  xk.  The 
solution  is 


fc+i 


=  Xk  -  agk , 


(1.6) 


where  a  >  0  is  obviously  the  step  size.  When  step  4  of  Algorithm  1  must  solve  a  large,  nontrivial  linear 
system,  taking  the  gradient  step  has  a  clear  advantage. 

Approximating  AT A.  The  term  ^\\Ax+By—b\\\  in  CA(x,  y,  X)  contains  the  quadratic  term  | xT AT Ax . 
Sometimes,  replacing  AT A  by  a  certain  D  ss  Ar A  makes  step  4  (much)  easier  to  compute.  Then  one  can  let 


P  =  p(D  -  AtA). 


The  choice  of  P  effectively  turns  ixT  AT Ax  into  ^xT Dx  since 


—  1 1 Ax  +  By  —  fo|| |  +  ^\\x  —  xk\\ =  ^xT Dx  +  [terms  linear  in  x]  +  [terms  independent  of  x\. 


This  approach  is  useful  when  AT A  is  nearly  diagonal  ( D  is  the  diagonal  matrix),  or  is  an  orthogonal  matrix 
plus  error  ( D  is  the  orthogonal  matrix),  as  well  as  when  an  off-the-grid  operator  A  can  be  approximated 
by  its  on-the-grid  counterpart  that  has  very  fast  implementations  (e.g.,  the  discrete  Fourier  transforms  and 
FFT).  Note  that  P  can  be  indefinite. 

Goals  of  P  and  Q.  The  general  goal  is  to  wisely  choose  P  so  that  step  4  of  Algorithm  2  becomes  much 
easier  to  carry  out  and  the  entire  algorithm  runs  in  less  time.  The  same  applies  to  Q  of  step  3  of  Algorithm  2 
except  we  require  <5^0  for  provable  convergence.  In  the  ADM,  the  two  subproblems  can  be  solved  in  either 
order  (but  fixed  throughout  the  iterations).  However,  when  one  subproblem  is  solved  less  exactly  than  the 
other,  Algorithm  2  tends  to  run  faster  if  the  less  exact  one  is  solved  later  —  assigned  as  step  4  of  Algorithm 
2  —  because  at  each  iteration,  the  ADM  updates  the  variables  in  the  Gauss-Seidel  fashion.  If  the  less  exact 
one  runs  first,  its  relatively  inaccurate  solution  will  then  affect  the  more  exact  step,  making  its  solution  also 
inaccurate.  Since  the  less  exact  subproblem  should  be  assigned  as  the  later  step  4,  more  choices  of  P  are 
needed  than  Q ,  which  is  the  case  in  this  paper. 


1.3.  Summary  of  results.  Table  1.1  summarizes  the  four  scenarios  under  which  we  study  the  linear 
convergence  of  Algorithm  2,  and  Table  1.2  specifies  the  linear  convergent  quantities  for  different  types  of 
matrices  P,  P,  and  Q,  where 


P--P  +  p  AtA 


is  defined  for  the  convenience  of  convergence  analysis.  P  =  0  and  Q  =  0  correspond  to  exactly  solving  the 
x-  and  y- subproblems,  respectively.  Although  P  =  0  and  P  >-  0  are  different  cases  in  Table  1.2,  they  may 
happen  at  the  same  time  if  A  has  full  column  rank;  if  so,  apply  the  result  under  P  >-  0,  which  is  stronger. 
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Table  1.1 

Four  scenarios  leading  to  linear  convergence 


scenario 

strongly 

convex 

Lipschitz 

continuous 

full  row 

rank 

remark 

i 

/ 

v/ 

A 

if  Q  y  0,  B  has  full  column  rank 

2 

fi9 

v/ 

A 

3 

f 

V/,V5 

[A,B] 

B  has  full  column  rank 

4 

ft  9 

v/,vs 

[A,B] 

Table  1.2 

Summary  of  linear  convergence  results 


case 

P,P 

Q 

any  scenario  1-4 

Q-linear  convergence 

R-linear  convergence* 

i 

P  =  0 

=  0 

(Axk,Xk) 

2 

P  y  0 

=  0 

(xk,Xk) 

„fc  ,,/c  \k 

x  ,y  ,  A 

3 

P  =  0 

>-  0 

(. Axk,yk,Xk ) 

4 

P  y  0 

>-  0 

(xk,yk,  Xk) 

*  In  cases  1  and  2,  scenario  1,  R-linear  convergence  yk  requires  full 
column  rank  of  B\  otherwise,  only  Byk  has  R-linear  convergence 


The  conclusions  in  Table  1.2  are  the  quantities  that  converge  either  Q-linearly  or  R-linearly1.  Q-linear 
convergent  quantities  are  the  entireties  of  multiple  variables  whereas  R-linear  convergent  quantities  are  the 
individual  variables  xk ,  yk ,  and  Xk. 

Four  scenarios.  In  scenario  1,  only  function  /  needs  to  be  strongly  convex  and  having  Lipschitz 
continuous  gradient;  there  is  no  assumption  on  g  besides  convexity.  On  the  other  hand,  matrix  A  must  have 
full  row  rank.  Roughly  speaking,  the  full  row  rank  of  A  makes  sure  that  the  error  of  Xk  can  be  bounded 
just  from  the  x-side  by  applying  the  Lipschitz  continuity  of  V/.  One  cannot  remove  this  condition  or  relax 
it  to  the  full  row  rank  of  [A,  B]  without  additional  assumptions.  Consider  the  example  of  A  =  [1;  0]  and 
B  =  [0;  1],  where  [ A ,  B\  =  I  has  full  rank.  Since  Xk  is  not  affected  by  /  or  {xfc}  at  all,  there  is  no  way  to 
take  advantages  of  the  Lipschitz  continuity  of  V/  to  bound  the  error  of  Xk.  In  general,  without  the  full  row 
rank  of  A,  a  part  of  Xk  needs  to  be  controlled  from  the  y-side  using  properties  of  g. 

Scenario  2  adds  the  strong  convexity  assumption  on  g.  As  a  result,  the  remark  in  case  1  regarding  the 
full  column  rank  of  B  is  no  longer  needed. 

Both  scenarios  3  and  4  assume  that  g  is  differentiable  and  Vg  is  Lipschitz  continuous.  As  a  result,  the 
error  of  Xk  can  be  controlled  by  taking  advantages  of  the  Lipschitz  continuity  of  both  V/  and  Vg,  and  the 
full  row  rank  assumption  on  A  is  relaxed  to  that  on  [A,  B\,  which  is  a  common  assumption  since  otherwise 
certain  rows  of  [A,  B]  can  be  eliminated  without  changing  the  solution  (assuming  that  Ax  +  By  =  b  is 
consistent).  On  the  other  hand,  scenarios  3  and  4  exclude  the  problems  with  non-differentiable  g.  Compared 
to  scenario  3,  scenario  4  adds  the  strong  convexity  assumption  on  g  and  drops  the  remark  on  the  full  column 
rank  of  B. 

Under  scenario  1  with  Q  y  0  and  scenario  3,  the  remarks  in  Table  1.1  are  needed  essentially  because  yk 

1Suppose  a  sequence  {uk  }  converges  to  u* .  We  say  the  convergence  is  (in  some  norm  ||  ■  ||) 

•  Q-linear ,  if  there  exists  //  £  (0, 1)  such  that  ^  <  //.; 

'  \  '  /  ||iak—  u*  ||  — 

•  R-linear ,  if  there  exists  a  sequence  {crfc}  such  that  || uk  —  u*||  <  crk  and  ak  — >  0  Q-linearly. 
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gets  coupled  with  xk  and  Xk  in  certain  inequalities  in  our  convergence  analysis.  The  full  column  rank  of  B 
helps  bound  the  error  of  yk  by  those  of  xk  and  Xk. 

Four  cases.  When  P  =  0  (corresponds  to  exactly  solving  the  ADM  x-subproblem),  we  have  P  A  0  and 
only  obtain  linear  convergence  in  Ax.  However,  when  P  0,  linear  convergence  in  x  is  obtained. 

When  Q  =  0  (corresponds  to  exactly  solving  the  ADM  y-subproblem) ,  y  is  not  part  of  the  Q-linear 
convergent  joint  variable.  But,  when  Q  >~  0,  y  becomes  part  of  it. 

1.4.  The  penalty  parameter  j3.  It  is  well  known  that  the  penalty  parameter  j3  can  significantly  affect 
the  speed  of  the  ADM.  Since  the  rate  of  convergence  developed  in  this  paper  is  a  function  of  /3,  the  rate  can 
be  optimized  over  We  give  some  examples  in  Section  3.2  below,  which  shows  the  rate  of  convergence  is 
positively  related  to  the  strong  convexity  constant  of  /  and  g ,  while  being  negatively  related  to  the  Lipschitz 
constant  of  V/  and  Vg  as  well  as  the  condition  number  of  A,  B  and  [A,  B\.  More  analysis  and  numerical 
simulations  are  left  as  future  research. 

1.5.  Notation.  We  let  (•,  •)  denote  the  standard  inner  product  and  ||  ■  ||  denote  the  £2-norm  ||  •  ||2  (the 
Euclidean  norm  of  a  vector  or  the  spectral  norm  of  a  matrix).  In  addition,  we  use  Amin(A/)  and  Amax(M) 
for  the  smallest  and  largest  eigenvalues  of  a  symmetric  matrix  M,  respectively. 

A  function  /  :  Rra  — »  R  U  {+oo}  is  called  strongly  convex  with  constant  v  >  0  if  for  all  x\,x2  £  R"  and 
all  t  £  [0, 1], 

f(tx  1  +  (1  -  t)x2)  <  tf{x i)  +  (1  -  t)f(x2)  -  *uf(l  -  t)||ari  -  x2\\2.  (1.7) 

The  gradient  V/  is  called  Lipschitz  continuous  with  constant  Lf  if  for  all  x\,x2  £  R™, 

llV/^i)  —  V/(x2)||  <  Lf\\xi  —  x2\\.  (1.8) 

1.6.  Assumptions.  Throughout  the  paper,  we  make  the  following  standard  assumptions. 
Assumption  1.  There  exists  a  saddle  point  u*  :=  (x*,y*,X*)  to  problem  (1.1),  namely,  x* ,  y* ,  and  A* 

satisfy  the  KKT  conditions: 

BTX*  G  dg(y*),  (1.9) 

ATX*  G  df(x*),  (1.10) 

Ax*  +  By*  —  b  =0.  (1.11) 

When  assumption  1  fails  to  hold,  the  ADM  method  has  either  unsolvable  or  unbounded  subproblems  or  a 
diverging  sequence  of  Xk.  The  optimality  conditions  of  the  subproblems  of  Algorithm  2  are 

Bt A  -  pBTA{xk  -  xk+1)  +  Q(yk  -  yk+1)  £  dg(yk+1 ),  (1.12) 

ATX  +  P(xk  -xk+1)  £  df{xk+l).  (1.13) 

For  notation  simplicity,  we  introduce 

A  :=  Xk  -  /3{Axk+1  +  Byk+1  -  b).  (1.14) 

If  7  =  1,  then  A  =  Afc+1;  otherwise, 

A  -  Afe+1  =  (7  -  1  )/3(Axk+1  +  Byk+1  -  b)  =  (1  -  ^-)(Afc  -  Afc+1).  (1.15) 

A 

This  relation  between  A  and  Afc+1  is  used  frequently  in  our  analysis. 
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Assumption  2.  Functions  f  and  g  are  convex.  We  define  scalars  Vf  and  vg  as  the  strong  convexity 
constants  of  /  and  g ,  respectively.  Following  from  (1.7),  they  satisfy 

(si  -  s2,  ad  -  x2)  >  Vf  ||aq  -  a;2||2,  Vxi,  x2,  «i  G  df(x i),  s2  G  df(x2),  (1.16) 

{h -t2,yi -y2)  >  vg\\yi -y2\\2,  Vyi,  y2,  ti  G  dg{yi),  t2  G  dg(y2).  (1.17) 


From  the  convexity  of  /  and  g ,  it  follows  that  Vf,i/g  >  0,  which  are  used  throughout  Section  2.  They 
are  strictly  positive  if  the  functions  are  strongly  convex.  To  show  linear  convergence,  Section  3  uses  Vf  >  0 
and,  for  scenarios  3  and  4,  vg  >  0  as  well. 

1.7.  Organization.  The  rest  of  the  paper  is  organized  as  follows.  Section  2  shows  the  convergence  of 
Algorithm  2  under  mild  assumptions.  Then  Section  3,  under  the  assumptions  in  Table  1.1,  proves  the  global 
linear  convergence  of  Algorithms  2.  Section  4  discusses  several  interesting  applications  that  are  covered  by 
our  linear  convergence  theory.  Finally,  Section  5  presents  some  preliminary  numerical  results  to  demonstrate 
the  linear  convergence  behavior  of  Algorithm  2. 


2.  Global  convergence.  In  this  section,  we  show  the  convergence  of  Algorithm  2.  The  proof  steps 
are  similar  to  the  existing  ADM  convergence  theory  in  [21,22]  but  are  adapted  to  Algorithm  2.  Several 
inequalities  in  the  section  are  used  in  the  linear  convergence  analysis  in  the  next  section. 

2.1.  Convergence  analysis.  For  notation  simplicity,  we  introduce 


u 


*  . 


/  xfc+lN 


V 

V  * 


fe+1 


for  k  =  0, 1, . . . , 


where  u*  is  a  KKT  point,  uk  is  the  current  point,  and  u  is  the  next  point  as  if  j  =  1,  and 


Go  I  Im  |  ,  Gt 
7  h, 


IP 


\ 


P 


Q 


(_T0  ^1  — 


Q 


(2.1) 


1/ 
p  PJ 


J  T  , 
01  P) 


where  we  recall  P  =  P  +  (3AT A.  From  these  definitions  it  follows 


uk+1  =uk-G  o(uk-u). 


(2.2) 


We  choose  P ,  Q  and  (3  such  that  P  F  0  and  Q  d  0.  Hence  G  A  0  and  ||  •  ||g  is  a  (semi-)norm.  The  analysis 
is  based  on  bounding  the  error  ||wfc  —  m*||g  and  estimate  its  decrease. 

Lemma  2.1.  Under  assumptions  1  and  2,  the  sequence  {ufc}  of  algorithm  2  obeys 


\\uk  —  u* 


\% 


.fc+1 


~  u*\\g  —  h(uk  —  v)  +  2^/| 


k+ 1 


+  2ug\\ yk+1~y* 


(2.3) 


where 


h(uk  -  u)  =  h{xk  -  xk+\  yk  -  yk+1,Xk  -  A) 


k  _  „k+ il|2  +  || yk  _  yk+ 1||2  +  ‘2X-^L\\\k  -  A||2  +  2(Afc  -  A )TA(xk  -  xk+1). 


=  \\x  —  X 


Proof.  By  the  convexity  of  g  and  the  optimality  conditions  (1.9)  and  (1.12),  it  follows  that 


(yk+1  -  y\  Bt  (A  -  A*  -  f3A(xk  -  xk+ '))  +  Q(yk  - 


yk+1))  >  vg\\yk+1 


(2.4) 


Similarly,  by  the  convexity  of  /  and  the  optimality  conditions  (1.10)  and  (1.13),  we  have 


(xk+1  -  x*,  At  (A  -  A*  -  /M(xfc  -  xfe+1))  +  P(xk  -  xk+1))  >  vf\\xk+1  -  ir*||2. 

In  addition,  it  follows  from  (1.11)  and  (1.14)  that 

A(xk+1  -  x*)  +  B(yk+1  -  y *)  =  h Afc  -  A). 

Then,  adding  (2.4)  and  (2.5)  and  using  (2.6)  give 

*  (Afc  -  A,  A  -  A*  —  (3A(xk  -  xk+1))  +  (xk+1  -  x*,  P(xk  -  xk+1))  +  (yk+1  -  y*,  Q(yfc  -  yk+1)) 
>vf\\xk+1-x*r  +  vg\\yk+1-y*\\2, 
which  can  be  simplified  as 

(u  -  u*)TGr(txfe  -  u)  >  (A(xk  -  xk+1),  Xk-X)  +  uf\\xk+1  -  a;*||2  +  Vg\\ yk+1  -  y*  ||2. 

By  rearranging  the  terms,  we  have 

(uk  -  u*)TG^{uk  -  u)  >  ||«fe  -  ufGl  +  {A{xk  -  xk+1),Xk  —  A)  +  Vf\\xk+1  -  rr*||2  +  vg\\ yk+1  -  y*f. 
From  the  identity  ||a  —  c||G  —  ||6  —  c||G  =  2 (a  —  c)TG(a  —  b)  —  ||a  —  b\\G  and  (2.2),  it  follows  that 
II Uk  -  u*\\l  -  \\uk+1  -  u*fG  =  2 (uk  -  tz*)tGi(t/  -u)-  \\G0(uk  -  u)IIg- 


(2.5) 

(2.6) 

(2.7) 

(2.8) 

(2.9) 

(2.10) 


By  (2.9),  we  have 

\\uk  ~  u*\\l  -  \\uk+1  -  u*\\l  >  2|| uk  -  u\\2Gi  ||«fc  -  H||2GlGo  +  2 (A(xk  Xk+1),xk  ~  A) 

+  2vf\\xk+1  -  xl2  +  2vg\\yk+1  -  y*||2,  (2.11) 

and  thus  (2.3)  follows.  □ 

In  the  next  theorem,  we  bound  h(uk  —  u)  from  zero  by  applying  the  Cauchy-Schwarz  inequality  to  its 
cross  term  2(Afc  —  X)T A{xk  —  xk+l).  If  P  =  0,  a  more  refined  bound  is  obtained  to  give  7  a  wider  range  of 
convergence. 

Theorem  2.2.  Assume  assumptions  1  and  2.  (1)  When  P  /  0,  if  7  obeys 

(2  -  i)P  >-  (7  -  1)/3AtA  (2.12) 


then  there  exists  77  >  0  such  that 

II uk  u* \\l  -  ||ufc+1  -  u*\\l  >  y\\uk  -  uk+1\\l  +  2vf\\xk+1  -  x*||2  +  2^||yfc+1  -  y*||2. 
(2)  When  P  =  0,  if 


7<£  (0, -(1  +  \/5)), 


then  there  exist  77  >  0  such  that 


I u  -u  ||G 
,k 


P, 


*112  ,  —  II  II 2 


P 

>  77||ufc  -  ufc+1||^  +  2^11^  -  xk+kf  +  2vf\\xk+1  -  a;*||2  +  2^||yfe+^  -  y 


-  (J|t/+1  -  U 
,k  ^k-P  1 1 1 2 


*  II 2 


fc+1 1|  2 


p 

k+ 1 


11 2 


(2.13) 


(2.14) 


(2.15) 


where 


rk  :=  Axk  +  Byk  —  b 


is  the  residual  at  iteration  k. 

If  we  set  7  =  1,  then  we  have 

\\uk  —  u*\\q  —  \\uk+1  —u*\\g  >  r)\\uk  ~uk+1\\Q+2i'f\\xk  —  xk+1\\2  +2vf\\xk+1  —x*\\2  +2vg\\yk+1 


y* II2,  (2.16) 


where  r)  =  1. 

Proof. 

(1)  By  the  Cauchy-Schwarz  inequality,  we  have 

2(Xk  -  X )TA(xk  -  xk+1)  >  -  1 II A(xk  -  xk+1)\\2  -  p\\Xk  -  A||2,  Vp  >  0.  (2.17) 

P 

Substituting  (2.17)  into  (2.3)  and  using  A(Afc  —  Afc+1)  =  Afc  —  A,  we  have 

\\uk-u*wi-\\uk+i-u*rG 

>  \\xk  xk+%_,ATA  +  II yk  -  yk+%  +  -  p)  ^l|Afe  -  Afc+1||2  (2.18) 

+  2vf\\xk+1  -  x*||2  +  2i/g\\yk+1  -  j/*||2,  Vp  >  0. 

To  show  that  (2.13)  holds  for  a  certain  p  >  0,  we  only  need  P  —  j;ATA  y  0  and  —  p  >  0  for  a  certain 
p  >  0,  which  is  true  if  and  only  if  we  have  P  y  AT A  or,  equivalently,  (2.12). 

(2)  For  P  =  0,  we  first  derive  a  lower  bound  for  the  cross  term  ( Xk  —  X)TA(xk  —  xk+1).  Applying  (1.13)  at 
two  consecutive  iterations  with  P  =  0  and  in  light  of  the  definition  of  A,  we  have 

J  AT[Xk~1-p(Axk  +  Byk-b)]Gdf(xk), 

{  ATX  e  df(xk+1).  (  } 

The  difference  of  the  two  terms  on  the  left  in  (2.19)  is 

AT[Afc_1  -  0(Axk  +  Byk  —  h)  —  A]  =  AT(Afc  —  A)  —  (1  —  j)^AT(Axk  +  Byk  -  b).  (2.20) 

By  (2.19),  (2.20)  and  (1.16),  we  get 

(AT(Xk  -  X),xk  -  xk+1)  -  ((1  -  7  )pAT{Axk  +  Byk  -  b)7xk  -  xk+1 )  >  vf\\xk  -  xk+1f,  (2.21) 


to  which  applying  the  Cauchy-Schwarz  inequality  gives 
(Afe  -  X)T A{xk  -  xk+1) 

>(^P(Axk  +  Byk  -  6),  (1  -  7  )^A(xk  -  xk+1))  +  vf\\xk  -  xfc+1||2 

>  -  |^||Arfc  +  Byk  -  b||2  -  (1  ~  ^)2/3f>\\A(xk  -  xk+1)\\2  +  uf\\xk  -  zfc+1||2,  Vp  >  0.  (2.22) 

Substituting  (2.22)  into  (2.3)  and  using  P  =  P  +  f3AT A  =  /3AT A  and  the  definition  of  A,  we  have 

\\uk-u*\\2G  +  ^\\Axk  +  Byk-b\\2 
P 

>  ||wfc+1  -  u*\\2G  +  -||  ATfe+1  +  Byk+l  -  b\\2 

P  (2.23) 

+  P  (2  -  7  -  £)  II Axk+1  +  Byk+1  -  b\\2  +  /3  (1  -  (1  -  7)2p)  || A(xk  -  xfe+1)||2 
+  I \yk  ~  yk+%  +  2"f\\xk  Xk+1\\2  +  2vf\\xk+i  -  x*||2  +  2vg\\yk+1  -  y*||2. 
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To  prove  such  77  >  0  exists  for  (2.15),  we  only  need  the  existence  of  p  >  0  such  that  2  —  7  —  A  >0  and 
1  —  (1  —  7 )2p  >  0,  which  holds  if  and  only  if  2  —  7  >  (1  —  r))2  or,  equivalently,  7  £  (0,  1+2V^). 

In  this  case  of  P  =  0,  if  we  set  7  =  1,  (2.22)  reduces  to  ( Xk  —  X)TA(xk  —xk+1)  >  Vf\\xk  —  xk+1\\2 ,  which 
substituting  into  (2.3)  gives  (2.16)  with  77  =  1. 

□ 

Now  the  bounds  in  Theorem  2.2  are  used  to  give  the  global  convergence  of  Algorithm  2. 

Theorem  2.3  (Global  Convergence).  Consider  the  sequence  {uk}  generated  by  Algorithm  2.  Under 
assumptions  1  and  2  and  the  additional  assumption  that  {ufc}  is  bounded,  for  any  7  satisfying  its  conditions 
given  in  Theorem  2.2,  {zifc}  converges  to  a  KKT  point  u*  of  (1.1)  in  the  G-norm,  namely,  ||itfe  —  u*||g  — ►  0. 

Proof.  Being  bounded,  {uk}  has  a  converging  subsequence  {ufcj  }.  Let  u  =  linr^oo  ukj .  Next,  we  will 
show  u  is  a  KKT  point.  Let  u*  denote  an  arbitrary  KKT  point. 

Consider  P  /  0  first.  From  (2.13)  we  conclude  that  \\uk  —  u*\\q  is  monotonically  nonincreasing  and 
thus  converging,  and  due  to  77  >  0,  \\uk  —  uk+1\\Q  — >  0.  In  light  of  (2.1)  where  PA  0  and  Q  A  0,  we  obtain 
Xk  -  Xk+1  -A  0  or  equivalently, 

rk  =  (Axk+1  +  Byk+1  -  b) ->  0,  as  k  ->  00.  (2.24) 

Now  consider  P  =  0.  From  (2.15)  we  conclude  that  ||ufc  —  u*\\q  +  ^||7,fc||2  is  monotonically  nonincreasing 
and  thus  converging.  Due  to  77  >  0,  \\uk  —  uk+1\\2G  -A  0,  so  Xk  —  Xk+1  -A  0  and  (2.24)  holds  as  well. 
Consequently,  ||wfc  —  u*\\G  also  converges. 

Therefore,  by  passing  limit  on  (2.24)  over  the  subsequence,  we  have  for  P  =  0  or  not: 

Ax  +  By  —  6  =  0.  (2.25) 

Recall  the  optimality  conditions  (1.12)  and  (1.13): 

Bt X  -  /3BTA(xk  -  xk+1)  +  Q(yk  -  yk+1)  £  dg(yk+1), 

AT X  +  P(xk  -  xk+1)  £  df(xk+1). 

Since  ||ufe  —  'ufc+1|||:  ~ t  0,  in  light  of  the  definition  of  G  (2.1),  we  have  the  following: 

•  when  P  =  0,  A(xk  -  xk+1 )  - A  0; 

•  when  P/0,  the  condition  (2.12)  guarantees  P  A  0  and  thus  xk  —  xk+1  -A  0; 

•  since  Q  A  0,  we  obtain  Q(yk  —  yk+l)  -A  0. 

In  summary,  j3BT A(xk  —  xk+1),  Q(yk  —  yk+1),  and  P{xk  —  xk+1)  are  either  0  or  converging  to  0  in  k,  no 
matter  P  =  0  or  not. 

Now  on  both  sides  of  (1.12)  and  (1.13)  taking  limit  over  the  subsequence  and  applying  Theorem  24.4 
of  [26],  we  obtain: 

BT X  £  dg{y), 

ATX  £  df(x). 

Therefore,  together  with  (2.25),  u  satisfies  the  KKT  condition  of  (1.1). 

Since  u  is  a  KKT  point,  we  can  now  let  u*  =  u.  From  ukj  -A  u  in  j  and  the  convergence  of  ||ufc  —  u*\\G 
it  follows  || uk  -  u*\\2g  -A  0  in  k.  □ 

Remark  1.  By  the  definition  of  G,  the  convergence  ||7tfc  —  m*||q  — >  0  implies  the  following: 

(a)  Xk  -A  A*,  regardless  of  the  choice  of  P  and  Q; 

(b)  when  P  ^  0,  condition  (2.12)  guarantees  P  A  0  and  thus  xk  -A  x* ;  when  P  =  0,  Axk  -A  Ax* ; 

(c)  when  Q  A  0,  yk  — >  y* ;  otherwise ,  Byk  -A  By*  following  from  (2.24)  and  (2.25). 


(2.26) 

(2.27) 
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Remark  2.  Let  us  discuss  the  conditions  on  7.  If  P  0,  the  condition  (2.12)  is  always  he  satisfied 
for  0  <  7  <  1.  However,  in  this  case,  7  can  go  greater  than  1,  which  often  leads  to  faster  convergence  in 
practice.  If  P  )f-  0,  the  condition  (2.12)  requires  7  to  lie  in  (0,7)  where  0  <  7  <  1  depends  on  (3,  P,  and 
AT A.  A  larger  /?  would  allow  a  larger  7. 

In  particular,  when  the  x-suhproblem  is  solved  using  prox-linear  (P  =  (pi  —  f3AT A),  condition  (2.12)  is 
guarantee  by 

t||A||2+7<2.  (2.28) 


When  the  x-subproblem  is  solved  by  one-step  gradient  descent  (P 
f  is  quadratic),  a  sufficient  condition  for  (2.12)  is 


m\\2 


+ 7  <  2. 


A/  —  Hf  —  /3ATA,  where  Hf  =  V2/  and 


(2.29) 


Remark  3.  The  assumption  on  the  boundedness  of  the  sequence  {ufe}  can  be  guaranteed  by  various 
conditions.  Since  (2.13)  and  (2.15)  imply  that  || —  u*\\q  is  bounded,  {ufe}  must  be  bounded  if  P  >-  0  and 

Q  >-  0.  Furthermore,  if  P  =  0  and  Q  =  0,  we  have  the  boundedness  of  {(Axk,Xk)}  (since  || uk  —  u*\\q  is 

bounded)  and  that  of  {Byk}  by  (2.6),  so  in  this  case,  {ufc}  is  bounded  if 

(i)  matrix  A  has  full  column  rank  whenever  P  =  0;  and 

(ii)  matrix  B  has  full  column  rank  whenever  Q  =  0. 

In  addition,  the  boundedness  of  {uk}  is  guaranteed  if  the  objective  functions  are  coercive. 

3.  Global  linear  convergence.  In  this  section,  we  establish  the  global  linear  convergence  results  for 
Algorithm  2  that  are  described  in  Tables  1.1  and  1.2.  We  take  three  steps.  First,  using  (2.13)  for  P  0  and 
(2.15)  for  P  =  0,  as  well  as  the  assumptions  in  Table  1.1,  we  show  that  there  exists  S  >  0  such  that 

\\uk-  u*\\%  >(1  +  £)|kfc+1  -u*\\2c,  (3.1) 

where  u*  =  lim*,-^  uk  is  given  by  Theorem  2.3.  We  call  (3.1)  the  Q-linear  convergence  of  {uk}  in  G- 
(semi)norm.  Next,  using  (3.1)  and  the  definition  of  G,  we  obtain  the  Q-linear  convergent  quantities  in  Table 
1.2.  Finally,  the  R-linear  convergence  in  Table  2  is  established. 

3.1.  Linear  convergence  in  G- (semi) norm.  We  first  assume  7=1,  which  allows  us  to  simplify  the 
proof  presentation.  At  the  end  of  this  subsection,  we  explain  why  the  results  for  7=1  can  be  extended  to 
7  /  1  that  satisfies  the  conditions  of  Theorem  2.2.  Note  that  for  7  =  1,  we  have  (2.16)  instead  of  (2.15). 
Hence,  no  matter  P  =  0  or  P  7^  0,  both  inequalities  (2.13)  and  (2.16)  have  the  form 

\\uk-u*\\2G-\\uk+1-u*fG>C, 

where  C  stands  for  their  right-hand  sides.  To  show  (3.1),  it  is  sufficient  to  establish 

C  >6\\uk+1 -u*\\2g.  (3.2) 

The  challenge  is  that  ||wfc+1  -  u*\\2G  is  the  sum  of  ||a;fc+1  -  x*\\2p,  \\yk+1  -  y* \\2Q,  and  ^\\Xk+1  -  A*||2  but  C 

does  not  contain  terms  like  \\yk+1  —  y*  ||2  and  ||Afc+1  —  A*||2.  Therefore,  we  shall  bound  ||Afe+1  —  A*||2  and 

ll^fc+i  _  y* ||2^  from  tiie  existing  terms  in  C  or  using  the  strong  convexity  assumptions.  This  is  done  in  a 
series  of  lemmas  below. 

Lemma  3.1  (For  scenario  1,  cases  3  and  4,  and  scenario  3).  Suppose  that  B  has  full  column  rank.  For 
any  pi  >  0,  we  have 

II yk+1  -  y*f  <  Cl\\xk+1  -  x*||2  +  c2|| Afc  -  Afe+1||2, 
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(3.3) 


where  a  :=  (1  +  ^)||A||2  •  A  Jn(BTB)  >  0  and  c2  :=  (1  +  Mi) (07)  2  '  Xjn{BTB)  >  0. 

Proof.  By  (2.6),  we  have  \\B(yk+1  —  y*)\\2  =  ||^4(icfc+1  —  ic*)  —  ^(Afc  —  Afc+1)||2.  Then  apply  the  following 
inequality 

ll«  +  /2  <  (i  +  INI2  +  (1  +  Mi)IHI2,  V/12  >  0,  (3.4) 

to  its  right  hand  side.  □ 

Lemma  3.2  (For  scenarios  1  and  2).  Suppose  that  V/  is  Lipschitz  continuous  with  constant  Lf  and  A 
has  full  row  rank.  For  any  M2  >  1,  we  have 

II A  -  A*||2  <  c3||/+1  -  z*||2  +  c4||/  -  /+1||2,  (3.5) 

where  c3  :=  L2(l  -  ^)_1A“11n(AAT)  >  0  and  c4  :=  M2||/|2A~(n(A4T)  >  0. 

Proof.  By  the  optimality  conditions  (1.10)  and  (1.13)  together  with  the  Lipschitz  continuity  of  V/,  we 
have 


\\AT(X  -  A*)  +  P(xk  -  xk+1)\\2  =  ||V/(a;fe+1)  -  V/(cc*)||2  <  L2f\\xk+1  -  cc*||2.  (3.6) 


Then  apply  the  following  basic  inequality: 

II u  +  u||2  >  ^1  -  ^  ||u||2  +  (1  -  M2)||w||2,  V/12  >  o,  (3.7) 

to  the  left  hand  side  of  (3.6).  We  require  M2  >  1  so  that  (1  —  >  0.  □ 

Lemma  3.3  (For  scenarios  3  and  4).  Suppose  V /  and  V g  are  Lipschitz  continuous  and  has  full 

row  rank.  Let  c  :=  A~in([.A,  B\  [A,  B]T)  >  0.  For  any  M3  >  1  and  M4  >  0,  we  have 


II A  -  Al2  <  c5 \\xk  -  xk+1\\2  +  cell/  -  yk+%  +  c7\\xk+1  -  x * 


+  c8\\yk+k-y 


(3.8) 


where  c5  =  p,3(l  +  ^4)\\[PT,  -pATB]\\2c  >  0 ,  c6  =  p3(l  +  M4)||Q||2c  >  0,  c7  =  (1  -  A)  1L2fc  >  0,  and 

<*  =  <i-±riL*c>  o. 

Proof.  Combining  the  optimality  conditions  (1.9),  (1.10),  (1.13),  and  (1.12)  together  with  the  Lipschitz 
continuity  of  V/  and  Vg,  we  have 


(A  -  A*)  + 

P 

(xk  -  xk+1)  + 

0 

bt 

1 

to 

Q 

=l|V/(/+1)  -  V / (cc*) || 2  +  || Vg(/+1)  -  Vg(/)||2 
<L2\\xk+1  ^  x*\\2  +  L2g\\yk+1  -  y*\\2. 


(3.9) 


Similarly,  we  apply  the  basic  inequalities  (3.4)  and  (3.7)  to  its  left  hand  side.  □  With  the  above  lemmas,  we 
now  prove  the  following  main  theorem  of  this  subsection. 

Theorem  3.4  (Q-linear  convergence  of  uk  in  G-(semi)norm).  Under  the  same  assumptions  of  Theorem 
2.3  and  7  =  1,  for  all  scenarios  in  Table  1.1,  there  exists  6  >  0  such  that  (3.1)  holds. 

Proof.  Consider  the  case  of  P  =  0  and  the  corresponding  inequality  (2.16).  In  this  case  P  =  (3AT  A  F  0. 
Let  C  denote  the  right-hand  side  of  (2.16). 

Scenarios  1  and  2  (recall  in  both  scenarios,  /  is  strongly  convex,  V/  is  Lipschitz  continuous,  and  A  has 
full  row  rank).  Note  that  C  contains  the  terms  on  the  right  side  of  (3.5)  with  strictly  positive  coefficients. 
Hence,  applying  lemma  3.2  to  C,  we  can  obtain 


C>  (c9||/+1 


*||2  +  c10||/+1  -  /||2  +  cn||Afc+1  -  A* ||2)  +  (c12||/  -  /+1||^  +  c13 1| Afc  -  Afc+1||2)  (3.10) 


—  X 
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with  cg,cn  >  0,  cio  =  2 vg  >  0,  C\2  =  rj  >  0,  and  C13  =  rj/(i 3j)  >  0.  We  have  eg  >  0  because  only  a  fraction 
of  2i/f\\xk+1  —  x*\\2  is  used  with  lemma  3.2;  c$\\xk+1  —  x*\\2  is  unused  so  it  stays.  The  same  principle  is 
applied  below  to  get  strictly  positive  coefficients,  and  we  do  not  re-state  it.  For  proof  brevity,  we  do  not 
necessarily  specify  the  values  of  Cj. 

For  scenario  1  with  Q  =  0,  ||wfc+1  —  it*|| 2G  =  ||xfe+1  —  x*\\2p  +  ^;||Afe+1  —  A*||2.  Since  ||a;fe+1  —  x*\\2  > 
Amax(P)_1|kfe+1  -  x*\\2p,  (3.2)  follows  from  (3.10)  with  S  =  min{c9A“^x(P), Cn/Fy}  >  0. 

For  scenario  1  with  Q  >-  0,  ||ufc+1  —  m*|| 2g  =  ||a;fc+1  —  x*\\2p  +  || yk+1  —  x*\\q  +  ^||Afc+1  —  A* || 2 .  Since  Ci0 
is  not  necessarily  strictly  positive,  we  shall  apply  Lemma  3.1  to  (3.10)  and  obtain 

C  >  (cu\\xk+1  -  x*\\2  +  c15|| yk+1  -  y*\\2  +  cn||Afc+1  -  A*||2)  +  c12\\yk  -  yk+%  (3.11) 

where  C14,  C15,  Cn,  C12  >  0.  So,  it  leads  to  (3.1)  with  5  =  nunlc^A^/P),  ci5Amix((9))  cii/^t}  >  0. 

Scenario  2  (recall  it  is  scenario  1  plus  that  g  is  strongly  convex).  We  have  cio  =  2vg  >  0  in  (3.10),  which 
gives  (3.1)  with  S  =  min{cgA“^x(P),  Ci0A“^x(Q),  Cn/Jy}  >  0.  Note  that  we  have  used  the  convention  that  if 
Q  =  0,  then  A“^X(Q)  =  00. 

Scenario  3  (recall  /  is  strongly  convex,  both  V/  and  V g  are  Lipschitz  continuous,  [A,  B]  has  full  row 
rank).  We  apply  lemma  3.1  to  get  ||yfc+1  —  y*\\2  with  which  we  then  apply  lemma  3.3  to  obtain 

C  >  c16\\xk+1  -  ®*||2  +  c17\\yk+1  -  /||2  +  c18||Afc+1  -  A*||2,  (3.12) 

where  Ci6,Ci7,Ci8  >  0  and  the  terms  ||xfc  —  £fe+1||2,  \\yk  —  yfc+1||2,  and  ||Afc  —  Afc+1||2  with  nonnegative 
coefficients  have  been  dropped  from  the  right-hand  side  of  (3.12).  From  (3.12),  we  obtain  (3.1)  with  <5  = 
min{ci6A-ix(P),Ci7A-ix(Q),ci8^7}  >  0. 

Scenario  f  (recall  it  is  scenario  3  plus  that  g  is  strongly  convex).  Since  Cn  =  2vg  >  0  in  (3.10),  we  can 
directly  lemma  3.3  to  get  (3.1)  with  S  >  0  in  a  way  similar  to  scenario  3. 

Now  consider  the  case  of  P  7I  0  and  the  corresponding  inequality  (2.11).  Inequalities  (2.11)  and  (2.16) 
are  similar  except  (2.16)  has  the  extra  term  ||a;fe  —  xfe+1||2  with  a  strictly  positive  coefficient  in  its  right-hand 
side.  This  term  is  needed  when  lemma  3.2  is  applied.  However,  the  assumptions  of  the  theorem  ensure  P  >-  0 

whenever  P  /  0.  Therefore,  in  (2.11),  the  term  ||ufe  —  uk+1\\Q,  which  contains  \\xk  —  xk+1\\2p,  can  spare  out 

a  term  ci9||irfe  —  xfe+1||2  with  cig  >  0.  Therefore,  following  the  same  arguments  for  the  case  of  P  =  0,  we  get 
(3.1)  with  certain  <5  >  0.  □ 

Now  we  extend  the  result  in  Theorem  3.4  (for  7  =  1)  to  7  7I  1  in  the  following  theorem. 

Theorem  3.5.  Under  the  same  assumptions  of  Theorem  2.3  and  7  7^  1,  for  all  scenarios  in  Table  1.1, 

1.  if  P  7^  0,  there  exists  6  >  0  such  that  (3.1)  holds; 

2.  if  P  =  0,  there  exists  6  >  0  such  that 

I \nk  -  u*\\2g  +  ^\\rk\\2  >  (1  +  S)  (j|«fc+1  -  u*\\2g  +  ^\\rk+1\\2)  ■  (3-13) 

Proof  When  7/  1,  which  causes  Afc+1  7^  A.  We  shall  bound  ||Afc+1  —  A*||2  but  lemmas  3.2  and  3.3  only 
give  bounds  on  ||A  —  A*||2.  Noticing  that  (A  —  A*)  —  (Afc+1  —  A*)  =  A  —  Afc+1  =  (7  —  l)rfe+1  and  C  contains 
a  strictly  positive  term  in  ||Afc  —  Afe+1||2  =  72||rfc+1||2,  we  can  bound  ||Afe+1  —  A*||2  by  a  positively  weighted 
sum  of  || A  -  A*||2  and  ||Afc  -  Afc+1||2. 

If  P  7!  0,  the  rest  of  the  proof  follows  from  that  of  Theorem  3.4. 

If  P  =  0,  7  7^  1  leads  to  (2.15),  which  extends  ||m*  —  m*||q  in  (2.16)  to  ||wl  —  m*||q  +  ^||ri||2,  for  i  =  k,k  + 1. 
Since  C  contains  ||Afc  —  Afc+1||2  =  72||rfe+1||2  with  a  strictly  positive  coefficient,  one  obtains  (3.13)  by  using 
this  term  and  following  the  proof  of  Theorem  3.4.  □ 
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3.2.  Explicit  formulas  of  the  linear  rate.  To  keep  the  proof  of  Theorem  3.4  easy  to  follow,  we 
have  avoided  giving  the  explicit  formulas  of  Cj’s  and  thus  also  those  of  S.  To  give  the  reader  an  idea  what 
quantities  affect  6,  we  discuss  the  value  of  S  for  case  1  of  different  scenarios  with  7  =  1. 

Scenario  1  and  2: 


S  =  2  vf 


mu2 


T 2 

Lf 


p\min(AAT) 


(3.14) 


Since  it  is  better  to  have  larger  6,  we  can  choose  B  = - 'H-  —  and  obtain 

6  H  \\A\\^UAA^) 


4in  v 


KAnf 


(3.15) 


where  nA  ■=  \J Amax  ( AAT ) /  Amin  ( AAT )  is  the  condition  number  of  matrix  A,  and  Kf  =  Lf/vf  is  the  condition 
number  of  function  /.  Not  surprisingly,  the  convergence  rate  is  negatively  affected  by  the  condition  numbers 
of  A  and  /. 

Scenario  3: 


S  =  min 


_ 2 fivf _  /32Xmin(ATA)  +  2/3vf  1  | 

/32||H||2 +  c7 +cic8’  c5  ’  c2c8  J  ' 


(3.16) 


The  formulas  of  cfs  are  given  in  the  previous  subsection,  which  involve  some  arbitrary  constants  y.\  >  0  and 
p-3  >  1  (/Z4  =  00  here  since  Q  =  0).  Therefore,  we  can  maximize  <5  over  /q  >  0  and  /L/,3  >  1  as  well  as  the 
parameter  /?  >  0  to  get  better  rate. 

Scenario  4  ■ 


S  =  min 


2/3isf  /32\min(ATA)  +  2/3vf  2/3vg\ 

^2||A||2  +  C7’  C5  ’  C8  J- 


(3.17) 


Similarly,  the  value  of  6  can  be  optimized  over  g  1  >  0  and  j3  >  0. 

The  formulas  of  5  for  scenarios  3  and  4  appear  to  be  more  complicated  than  the  nice  formula  (3.15)  for 
scenarios  1  and  2.  However,  a  close  look  at  these  formulas  reveals  that  the  convergence  rate  is  negatively 
affected  by  the  condition  numbers  of  the  constraint  matrices  A ,  B  and  [A,B],  as  well  as  the  condition 
numbers  of  the  objective  functions  /  and  g. 

Due  to  page  limit,  we  leave  other  cases  and  further  analysis  to  future  research. 


3.3.  Q-linear  convergent  quantities.  From  the  definition  of  G,  which  depends  on  P  and  Q ,  it  is 
easy  to  see  that  the  Q-linear  convergence  of  uk  =  (x k;yk;Ak)  translates  to  the  Q-linear  convergence  results 
in  Table  1.2.  For  example,  in  case  1  (P  =  0  and  Q  =  0),  ||ufc+1  -  u*fG  =  ||a;fc+1  -  x*\\2p  +  ^||Afe+1  -  A*||2, 
where  P  =  P  +  [3AAr  =  PATA.  Hence,  ( Axk ,  Xk)  converges  Q-linearly.  Examining  ||ufc+1  —  u*\\q  gives  the 
results  for  cases  2,  3,  4. 


3.4.  R-linear  convergent  quantities.  By  the  definition  of  R-linear  convergence,  any  part  of  a  Q- 
linear  convergent  quantity  converges  R-linearly.  For  example,  in  case  1  (P  =  0  and  Q  =  0),  the  Q-linear 
convergence  of  (Axk,Xk)  in  Table  1.2  gives  the  R-linear  convergence  of  Axk  and  Xk.  Therefore,  to  establish 
Table  1.2,  it  remains  to  show  the  R-linear  convergence  of  xk  in  cases  1  and  3  and  that  of  yk  in  cases  1  and 
2.  Our  approach  is  to  bound  their  errors  by  existing  R-linear  convergent  quantities. 

Theorem  3.6  (R-linear  convergence).  The  following  statements  hold. 

1.  In  cases  1  and  3,  if  Xk  converges  R-linearly,  then  xk  converges  R-linearly. 

2.  In  cases  1  and  2,  scenario  1,  if  Xk  and  xk  both  converge  R-linearly,  then  Byk  converges  R-linearly. 

In  addition,  if  B  has  full  column  rank,  then  yk  converges  R-linearly. 
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3.  In  cases  1  and  2,  scenarios  2~4,  if  Xk  and  xk  both  converge  R-linearly,  then  yk  converges  R-linearly. 
Proof.  We  only  show  the  result  for  7  =  1  (thus  A  =  Afc+1);  for  7  7^  1  (thus  A  7^  Afe+1),  the  results  follow 
from  those  for  7  =  1  and  the  R-linear  convergence  of  ||A  —  Afc+1||2,  which  itself  follows  from  (1.15)  and  the 
R-linear  convergence  of  Xk  (thus  that  of  Xk  —  Afc+1). 

1.  By  (2.5)  and  P  =  f3ATA ,  we  have  Vf\\xk+1  —  x*||2  <  ||A||||xfe+1  —  x*||||Afc+1  —  A*||,  which  implies 

||a;fe+1  -  x* || 2  <  ^f||Afc+1  -  A*||2.  (3.18) 

vf 

2.  The  result  follows  from  (2.6). 

3.  Scenario  3  assumes  the  full  column  rank  of  B  (or  uses  Algorithm  ??),  so  the  result  follows  from  (2.6).  In 
scenarios  2  and  4,  g  is  strongly  convex.  Recall  (2.4)  with  A  =  Afc+1: 

(yk+1  -  y\  Bt  (Afc+1  -  A*  -  pA{xk  -  xk+1))  +  Q{yk  -  yk+1))  >  vg\\ yk+1  -  y* ||2.  (3.19) 

By  the  Cauchy-Schwarz  inequality  and  Q  =  0,  we  have 

vg\\ yk+1  -  y*\\  <  ||B||||Afc+1  -  A*  -  (3A(xk  -  xfe+1)||.  (3.20) 

Therefore,  the  result  follows  from  the  R-linear  convergence  of  xk  and  Xk.  □ 

4.  Applications.  This  section  describes  several  well-known  optimization  models  on  which  Algorithm 
2  not  only  enjoys  global  linear  convergence  but  also  often  has  easy-to-solve  subproblems. 

4.1.  Convex  regularization.  The  following  convex  regularization  model  has  been  widely  used  in 
various  applications: 


min  f{By  -  b)  +  g(y)  (4.1) 

v 

where  /  is  often  a  strongly  convex  function  with  Lipschitz  continuous  gradient,  and  g  is  a  convex  function 
which  is  very  versatile  across  different  applications.  In  particular  g  can  be  nonsmooth  (e.g.,  projection  to 
a  convex  set,  l-\ -norm) .  Here,  /  and  g  are  often  referred  to  as  the  loss  (or  data  fidelity)  function  and  the 
regularization  function,  respectively.  Model  (4.1)  can  be  reformulated  to 

mmf(x)  +  g(y),  s.t.  x  +  By  =  b  (4.2) 

and  be  solved  by  Algorithm  2.  With  many  popular  choices  of  /  and  g  and  also  with  proper  P  and  Q ,  the 
x-  and  j/-subproblems  are  easy  to  solve.  If  B  has  full  column  rank  or  g  is  strongly  convex,  then  Algorithm  2 
converges  at  a  global  linear  rate. 

4.2.  Sparse  optimization.  In  recent  years,  the  problem  of  recovering  sparse  vectors  and  low-rank 
matrices  has  received  tremendous  attention  from  researchers  and  engineers,  particularly  those  in  the  areas 
of  compressive  sensing,  machine  learning,  and  statistics. 

Elastic  net  (augmented  Ii)  model.  To  recover  a  sparse  vector  y°  £  R”  from  linear  measurements 
b  =  By0  £  Rm,  the  elastic  net  model  solves 

min  \\y\\i  +  a\\y\\2  +  ^-\\Ay  -  b\\2 ,  (4.3) 

v  2/i 

where  A  £  Rmxn,  a  >  0  and  \x  >  0  are  parameters,  and  the  I\  norm  ||j/||i  :=  \Vi\  is  known  to  promote 
sparsity  in  the  solution.  It  has  been  shown  that  the  elastic  model  can  effectively  recover  sparse  vectors  and 
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outperform  Lasso  ( a  =  0)  on  reported  real-world  regression  problems  [27].  With  the  constraint  x  =  y,  (4.3) 
can  be  reformulated  as: 

min  \\y\h  +  a\\x\\2  +  ^-\\Ax  -  b\\2 

x,y  2/i  (44) 

s.t.  x  —  y  =  0. 


Augmented  nuclear-norm  model.  Similarly,  the  elastic  net  model  can  be  extended  for  recovering 
low-rank  matrices.  To  recover  a  low-rank  matrix  Y°  £  R™lXn2  from  linear  measurements  b  =  B(Y°)  £  Rm, 
the  augmented  nuclear-norm  model  solves 

min  ||y||*+a||F|||  +  iI||A(F)-6||2,  (4.5) 

where  a  >  0  and  y,  >  0  are  parameters,  A  :  RniXn2  — ►  Rm  is  a  linear  operator,  ||  ■  ||f  denotes  the  Frobenius 
norm,  and  the  nuclear  norm  ||Yj|*  denotes  the  sum  of  singular  values  of  Y  which  is  known  to  promote 
low-rankness  in  the  solution.  By  variable  splitting  X  =  Y,  (4.5)  can  be  reformulated  as: 

min||y||,+a||A|||  +  ^||A(A)-6||2 

X’Y  (4.6) 

s.t.  X-Y  =  0. 


In  (4.4)  and  (4.6),  the  functions  f(x)  =  a||x||2  +  j^\\Ax  —  b\\2  and  f{X)  =  o:||-X"||^  +  ^ |A(X)  —  6||2  are 
strongly  convex  and  have  Lipschitz  continuous  gradient;  the  functions  g(y)  :=  ||y||i  and  g(Y)  :=  ||y||*  are 
convex  and  nonsmooth.  In  fact,  ||  •  ||2  and  ||  •  |||,  can  also  be  replaced  by  many  other  choices  of  functions 
that  are  strongly  convex  and  have  Lipschitz  continuous  gradient,  or  become  so  when  restricted  to  a  bounded 
set.  Note  that  if  a  =  0  then  the  functions  /  may  not  be  strongly  convex  if  the  matrix  A  and  the  linear 
operator  A  do  not  have  full  column  rank.  In  many  applications,  this  is  indeed  the  case  since  the  number  of 
observations  of  y  and  Y  is  usually  smaller  than  their  dimensions  (i.e. ,  m  <  n  and  m  <  rii  ■  712).  However,  the 
parameter  a  >  0  guarantees  the  strong  convexity  of  /,  and  hence  the  global  linear  convergence  of  Algorithm 
2  when  applied  to  (4.4)  and  (4.6).  In  addition,  it  has  been  shown  in  [28]  that  for  most  compressive  sensing 
problems,  with  a  moderately  small  a,  problems  (4.4)  and  (4.6)  return  solutions  as  if  a  =  0. 

4.3.  Consensus  and  sharing  optimization.  Consider  in  a  network  of  N  nodes,  the  problem  of 
minimizing  the  sum  of  N  functions,  one  from  each  node,  over  a  common  variable  x.  This  problem  can  be 
written  as 


N 

min  y 

161"  2^-—' 
»  =  1 


(4.7) 


Let  each  node  i  keep  vector  Xi  £  R™  as  its  copy  of  x.  To  reach  a  consensus  among  Xi,  i  =  1, . . . ,  N,  a  common 
approach  is  to  introduce  a  global  common  variable  y  and  get 


N 


min 

{xi},y 


s.t.  Xi  —  y  =  0,  i  =  l,...,N. 


(4.8) 


This  is  the  well-known  global  consensus  problem;  see  [7]  for  a  review.  With  an  objective  function  g  on  the 
global  variable  y ,  we  have  the  global  variable  consensus  problem  with  regularization: 


N 


min 

{xi},y 


E 


fi(xi)  +g(y), 


s.t.  Xi  -  y  =  0,  i  =  l,...,N, 


(4.9) 
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where  g(y)  is  a  convex  function, 

The  following  sharing  problem  is  also  nicely  reviewed  in  [7]: 


N 


min 

{ Xi},y 


E 


fi(xi)+g 


s.t.  Xi-yi  =  0,  i  =  l,...,N, 


(4.10) 


where  s  are  local  cost  functions  and  g  is  the  shared  cost  function  by  all  the  nodes  i. 

Algorithm  2  applied  to  the  problems  (4.8),  (4.9)  and  (4.10)  converges  linearly  if  each  function  /,  is 
strongly  convex  and  has  Lipschitz  continuous  gradient.  The  resulting  ADM  is  particularly  suitable  for 
distributed  implementation,  since  the  x-subproblem  can  be  decomposed  into  N  independent  Zi-subproblems, 
and  the  update  to  the  multiplier  A  can  also  be  done  at  each  node  i. 

5.  Numerical  demonstration.  We  present  the  results  of  some  simple  numerical  tests  to  demonstrate 
the  linear  convergence  of  Algorithm  2.  The  numerical  performance  is  not  the  focus  of  this  paper  and  will  be 
investigated  more  thoroughly  in  future  research. 

5.1.  Elastic  net.  We  apply  Algorithm  2  with  P  =  0  and  Q  =  0  to  a  small  elastic  net  problem  (4.4), 
where  the  feature  matrix  A  has  m  =  250  examples  and  n  =  1000  features.  We  first  generated  the  matrix 
A  from  the  standard  Gaussian  distribution  Af(0, 1)  and  then  orthonormalized  its  rows.  A  sparse  vector 
x°  £  K"  was  generated  with  25  nonzero  entries,  each  sampled  from  the  standard  Gaussian  distribution.  The 
observation  vector  b  £  Rm  was  then  computed  by  b  =  Ax°  +  e,  where  e  ~  A/”(0, 10~3/).  We  chose  the  model 
parameters  a  =  0.1  and  yt  =  10~2,  which  we  found  to  yield  reasonable  accuracy  for  recovering  the  sparse 
solution.  We  initialized  all  the  variables  at  zero  and  set  the  algorithm  parameters  (3  =  100  and  7  =  1.  We 
ran  the  algorithm  for  200  iterations  and  recorded  the  errors  at  each  iteration  with  respect  to  a  precomputed 
reference  solution  it*. 

Figure  5.1(a)  shows  the  decreasing  behavior  of  \\uk  —  m*||q(:=  (3\\xk  —  x*\\2  +  ||Afc  —  A*||2//3)  as  the 
algorithm  progresses.  Since  variable  y  is  not  contained  in  the  G-norm,  we  also  plot  the  convergence  curve  of 
|| yk  —  y*  ||2  in  Figure  5.1(b).  We  observe  that  both  uk  and  yk  converge  at  similar  linear  rates.  In  addition, 
the  convergence  appears  to  have  different  stages.  The  later  stage  exhibits  faster  convergence  rate  than  the 
earlier  stage.  This  can  be  clearly  seen  in  Figure  5.2  which  depicts  the  Q-linear  rate  ||ztfe+1—  u*||g./||Mfe  —  u*||g. 


(a)  || uk  —  u*|| q  vs.  iteration  (b)  \\yk  —y*\\2  vs.  iteration 

Fig.  5.1.  Convergence  curves  of  ADM  for  the  elastic  net  problem. 
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Fig.  5.2.  Q-linear  convergence  rate  of  ADM  for  the  elastic  net  problem. 


Here,  the  strong  convexity  constant  of  /  is  v/  =  2a  +  Ami n(ATA)//j,  =  2a  and  the  Lipschitz  constant  of 
V/isL/  =  2a+Amax(ATA)//j  =  2a+l / /x.  By  (3.14),  our  bound  for  the  global  linear  rate  is  (l+J)^1  =  0.998, 
which  roughly  matches  the  early-stage  rate  shown  in  the  figure.  However,  our  theoretical  bound  is  rather 
conservative,  since  it  is  a  global  worst-case  bound  and  it  does  not  take  into  account  the  properties  of  the 
i\  norm  and  the  solution.  In  fact,  the  optimal  solution  x*  is  very  sparse  and  xk  will  also  become  sparse 
after  a  number  of  iterations.  Let  S  be  an  index  set  of  the  nonzero  support  of  ( xk  —  x*),  and  As  be  a 
submatrix  composed  of  those  columns  of  A  indexed  by  S.  Then,  the  constants  vj  and  Lf  in  our  bound  can 
be  effectively  replaced  by  Uf  =  2a  +  Ami n{A^As)/fi  and  Lf  =  2a  +  A max  ( A]j  As  )//x,  thereby  accounting  for 
the  faster  convergence  rate  in  the  later  stage.  For  example,  letting  S  be  the  nonzero  support  of  the  optimal 
solution  x*,  we  obtain  an  estimate  of  the  (asymptotic)  linear  rate  (1  +  <5)_1  =  0.817,  which  well  matches  the 
later-stage  rate. 


5.2. 


Distributed  Lasso.  We  consider 

min 

{zil.y 

s.t. 


solving  the  Lasso  problem  in  a  distributed  way  [29]: 

N  1 

Xi-y  =  0,  i  =  l,...,N, 


(5.1) 


which  is  an  instance  of  the  global  consensus  problem  with  regularization  (4.9). 

We  apply  Algorithm  2  with  P  =  0  and  Q  =  0  to  a  small  distributed  Lasso  problem  (5.1)  with  N  =  5, 
where  each  Ai  has  m  =  600  examples  and  n  =  500  features.  Each  Ai  is  a  tall  matrix  and  has  full  column 
rank,  yielding  a  strongly  convex  objective  function  in  Xi.  Therefore,  Algorithm  2  is  guaranteed  to  converge 
linearly. 

We  generated  the  data  similarly  as  in  the  elastic  net  test.  We  randomly  generated  each  Ai  from  the 
standard  Gaussian  distribution  Af(0, 1),  and  then  simply  scaled  its  columns  to  have  a  unit  length.  We 
generated  a  sparse  vector  x°  £  K™  with  250  nonzero  entries,  each  sampled  from  the  Af( 0, 1)  distribution. 
Each  bi  £  Rm  was  then  computed  by  bi  =  Ai x°  +  ej,  where  e,  ~  Af( 0, 10~3/).  We  chose  the  model  parameter 
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fi  =  0.1,  which  we  found  to  yield  reasonably  good  recovery  quality.  From  the  initial  point  at  zero,  we  ran 
the  algorithm  with  parameters  /3  =  10  and  7  =  1  for  50  iterations  and  computed  the  iterative  errors. 

Figure  5.3  demonstrates  the  clear  linear  convergence  behavior  of  \\uk  —  u*\\q  and  \\yk  —  y*\\2.  In  Figure 
5.4,  the  Q-linear  convergence  rate  of  ||ufe  —  u*\\q  is  depicted.  For  this  problem,  the  strong  convexity  constant 
is  Vf  =  mini { Amin (Aj  Ai)/fi}  and  the  Lipschitz  constant  is  Lf  =  maxi{Amax(Af  Ai)/fi}.  However,  the 
condition  number  Vf/Lf  in  this  test  is  relatively  big,  and  hence  the  theoretical  linear  rate  specified  by  (3.15) 
is  not  a  very  tight  bound  for  the  observed  fast  rate.  Note  that  all  x^s  tend  to  be  equal  and  become  sparse 
after  a  number  of  iterations.  Similar  to  our  previous  discussion  in  Section  5.1,  we  can  estimate  the  asymptotic 
linear  rate  by  letting  Df  =  Amj n(A^As)/{nN)  and  Lf  =  A max(4sAs)/(/r./V),  where  A  G  ^Nmxn  js  formed 
by  stacking  all  the  matrices  At  (*  =  1 , ,N),  and  S  is  an  index  set  of  the  nonzero  support  of  x* .  We 
obtained  the  asymptotic  linear  rate  to  be  (1  +  5)-1  =  0.779,  which  appears  to  be  a  much  tighter  bound. 


(a)  \\uk  —u*\\q  vs.  iteration  (b)  \\yk  —  y*  ||2  vs.  iteration 

Fig.  5.3.  Convergence  curves  of  ADM  for  the  distributed  Lasso  problem. 


Convergence  rate  of  uk 


Fig.  5.4.  Q-linear  convergence  rate  of  ADM  for  the  distributed  Lasso  problem. 
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6.  Conclusions.  In  this  paper,  we  provide  sufficient  conditions  for  the  global  linear  convergence  of  a 
general  class  of  ADMs  which  solve  subproblems  either  exactly  or  approximately  in  a  certain  manner.  Among 
the  conditions  is  a  function  that  is  strongly  convex  and  has  Lipschitz  continuous  gradient.  These  sufficient 
conditions  cover  a  wide  range  of  applications.  We  also  extend  the  existing  convergence  theory  to  allow  more 
generality  on  the  step  size  7  for  updating  the  multipliers. 

In  practice,  how  to  choose  the  penalty  parameter  /?  is  always  an  important  issue.  Our  convergence 
rate  analysis  provides  more  insights  on  how  the  penalty  parameter  (3  affects  the  convergence  speed,  thereby 
providing  some  theoretical  guidance  for  choosing  /?. 
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